1 Introduction

The field of data science deals with the extraction of knowledge from data. From this target a sophisticated set of interdisciplinary tasks arises, which demand different skills in the area of data exploration.
The activities described in this report are intended to illustrate the exemplary implementation of such data science tasks by programming based on a simulated business project. For this purpose, data from an online shop was provided. During the course of the project the import, cleaning and manipulation of data was implemented. Also, the topics exploratory analysis, visualization, experiment analysis and machine learning (construction of a model) were covered and their results are described in the following sections of this report. The realization of all these tasks was either done in the programming language R or Python. The first step, the data cleaning, was even implemented in both languages.

The project’s main task is to analyze the given datasets, especially by creating appropiate visualizations and summary tables for the data, that is suspected to be relevant for the online shop in some way. In addition to this, an experiment analysis has to be executed, which calculates and presents the performance strength of different recomendation systems. Furthermore, a data model should be constructed in order to predict, whether a user of the online shop will order something. This prediction should be based on the browsing behaviour of a person on the website.

2 Data Description

The provided data comes from an online shop selling beauty products. There are two datasets: One with data about customer orders and another with data about customer clicks on the website.
The orders dataset consists of one row for each customer order in the time period from the 28. January 2000 to the 3. March 2000. The data contains values characterizing the ordered products, the used payment methods, the order process itself and social data, as by example the customer’s location or age.
Whereas, the clicks dataset contains data referring to customer clicks on the website of the given company from the 14. April 2000 to the 30. April 2000. It is mainly composed of data giving information about the payment methods, customer attributes and preferences, product details and of values describing each click itself as by example the request time or the current page URL. Through this data it is possible to illustrate the whole session course of a customer.
Both datasets share a considerable amount of columns. However, since not every click results in an order and since a session consists normally of more than one click, the contents differ significantly.

3 Data Manipulation

Before we started cleaning the data, we copied it to a separated folder. The reason for this was to avoid accidentally altering the original dataset by separating the edited datasets from the original ones. The cleaned data and further forms of the datasets were also saved to this directory. For a structure overview of the aforementioned folder see the Appendix.

3.1 Cleaning

The cleaning process consists of the following steps:

  1. Copy the original datasets

  2. Read the data from the files

  3. Add headers

  4. Replace “?” and “NULL” with NA

  5. Drop columns with a 100% ratio of missing data

  6. Reformat datetime cells

  7. Save the result

  8. Save a subset of the cleaned data, containing only 1000 rows, to support a quick view into the cleaned data

A similar cleaning process to the one explained above has been implemented in Python and R.

Note: Python coding chunks are executable in RMarkdown in general, but the Python environment is not persistent across different python chunks for the preview function to run coding. Despite this, the chunks are compiled together, when the document is knitted.
To execute the provided code you may need to enter your Python path in a R coding block by “use_python()”. If this does not work, copy the Python coding into the Python IDE of your choice, ideally under the main repository directory “./”.
In addition to having the packages installed, the software “graphviz” needs to be installed. If any bugs related to graphviz prevent you from running the code, you can use the decision tree figure provided in this report as a reference.

To test if the cleaning scripts in Python and R result in the same file, we implemented a short coding to create a diff view. The result showed that there are no differences in the cleaned versions of the datasets created through both languages.

3.2 Merging

Note: As we explain in this subchapter, the merged data is too small to be beneficial for any useful purpose. This is why we did not implement the merging in R after coding it in Python.

We tried to merge the click and order data in Python by trying different ID combinations that occur in both datasets. For testing the different combinations we used an inner join in order to be able to recognize easier, whether a merging try had success. We tried the following combinations for merging the two datasets, which resulted in the shown shapes for the merged dataset:

Clicks Orders Shape
Session ID Order Line Session ID [0, 438]
Session ID Order Session ID [0, 438]
Customer ID Customer ID [6906, 437]
Session Cookie ID Order Line Session ID [0, 438]
Session Cookie ID Order Session ID [0, 438]

In this way we were able to discover that it is possible to join the datasets on the ‘Customer ID’ for some instances. Thus, we saved a dataset for the merging results on the Customer ID. But the merged data does not make much sense, since a customer ID can have multiple order and click rows. Another problem is that the time periods of the dataset do not overlap in any way. Because of this issues, we decided on building a second, smaller data subset containing only the customer information columns of both original datasets. The final merged customer dataset contains 80 attributes for 97 customers. Since only such a minor ratio of data could be merged, we consider the joined dataset as rather unimportant and did not perform any further analytical steps based on it.

4 Data Analysis

The aim of the data analysis is to extract information, which is suspected to be valuable to the online shop, and prepare it in a way that makes it easily “digestible”. The overview of the information is presented in summary tables and different kinds of visualizations.

4.1 Missing Data

Before creating overview tables or plots for columns, it makes sense to evaluate which columns actually contain a large quantity of information and which do not. To do a check up on the ratio of filled cells, we created a ranking for both datasets containing column names and the percentage of missing data for each column. Columns with a low percentage of missing data are then preferred in later analysis steps. To offer an impression on the results of this analysis, the first 20 entries of the resulting rankings can be seen in the following two tables.

Order Column Order NA % Click Column Click NA %
Order Line Date 0 Request Processing Time 0.000
Order Line Date_Time 0 Request Date 0.000
Order Line Unit List Price 0 Request Date_Time 0.000
Order Line ID 0 Request Sequence 0.000
Order Line Quantity 0 Request Template 0.000
Order Line Unit Sale Price 0 REQUEST_DAY_OF_WEEK 0.000
Order Line Status 0 REQUEST_HOUR_OF_DAY 0.000
Order Line Tax Amount 0 Cookie First Visit Date 0.000
Order Line Amount 0 Cookie First Visit Date_Time 0.000
Order Line Day of Week 0 Session First Request Date 0.000
Order Line Hour of Day 0 Session First Request Date_Time 0.000
City 0 Session Cookie ID 0.000
US State 0 Session ID 0.000
Account Creation Date 0 Session User Agent 0.000
Account Creation Date_Time 0 Session Visit Count 0.000
Account Status 0 Session First Processing Time 0.000
Customer ID 0 Session First Template 0.000
Order Date 0 Session First Request Day of Week 0.000
Order Date_Time 0 Session First Request Hour of Day 0.000
Order Customer ID 0 Session First Content ID 0.001

4.2 Structure and Content

4.2.1 Order Data

The order data can be mainly divided into 4 sections:

  1. Customer Data: For this section we regard all information referring to the customer as an individual. This data contains information such as gender, location, family status and retail activities.
    • City
    • Country
    • US State
    • Age
    • Marital Status
    • Gender
    • Audience
    • Truck Owner
    • RV Owner
    • Motorcycle Owner
    • Working Woman
    • Presence Of Children
    • Speciality Store Retail
    • Oil Retail Activity
    • Bank Retail Activity
    • Finance Retail Activity
    • Miscellaneous Retail Activity
    • Upscale Retail
    • Upscale Speciality Retail
    • Retail Activity
  2. Product Data: The following data columns describe features of the ordered products.
    • StockType
    • Manufacturer
    • BrandName
  3. Payment Data: This sections contains columns, which describe the payment methods used by customers.
    • Order Credit Card Brand
    • Bank Card Holder
    • Gas Card Holder
    • Upscale Card Holder
    • Unknown Card Type
    • TE Card Holder
    • Premium Card Holder
    • New Bank Card
  4. Order Data: The order data section contains information describing the order process itself, such as order quantity and price data.
    • Order Line Quantity
    • Order Line Unit List Price
    • Order Line Amount
    • Spend Over 12 Per Order On Average
    • Order Line Day of Week
    • Order Line Hour of Day
    • Order Promotion Code
    • Order Discount Amount

4.2.2 Clickstream Data

The clickstream data has three main categories: Customer data, product data and time data. The clickstream dataset contains payment methods as well, but the information coming from this attributes is not regarded in this observation, since the payment data is only available for customers that actually ordered. For analysis purposes only the most important or interesting attributes are discussed in detail. For the full range of columns see the Appendix.

  1. Customer Data: The data contains a vast collection of information about customers, reaching from usual information like age, gender, etc. over financial activities to opinions about the shop.
    • City
    • US State
    • Age
    • Marital Status
    • Gender
    • Audience
    • Truck Owner
    • RV Owner
    • Motorcycle Owner
    • Working Woman
    • Presence Of Children
    • Speciality Store Retail
    • Oil Retail Activity
    • Bank Retail Activity
    • Finance Retail Activity
    • Miscellaneous Retail Activity
    • Upscale Retail
    • Upscale Speciality Retail
    • Retail Activity
  2. Product Data: The following data columns describe features of products clicked on by the customers. Similar to the customer data, the product data listing is cut down to the most essential attributes.
    • StockType
    • Manufacturer
    • BrandName
  3. Time Data: The following data columns describe different data referring to the time of a click.
    • Request Date
    • REQUEST_DAY_OF_WEEK
    • REQUEST_HOUR_OF_DAY
    • Session Visit Count

4.3 Summary Tables

Given a subset of interesting columns, we create two types of summary tables for each: One table for numerical columns in the subset and another for factors. The summary table for the numerical data contains the maximum value, minimum value, mean, median and standard deviation for each column. Whereas, the factorial tables contain the five most frequent factors as well as their percentage, the ratio of NAs and other factors for each column. An important aspect to mention for the factor tables, is that the NA percentage gets calculated at first, then the NA values are deleted from the regarded column and the percentage for each factor value is calcuated.

Note: To support a better visualization, the most relevant columns are highlighted in black.

4.3.1 Order Data

In the following section the summary tables generated for the purpose of describing the order data are shown. Additionally, the most important or interesting analysis results are emphasized and shortly explained. Important to note is that the analysis is based on each order. That means the data of a customer can possibly be regarded multiple times in the customer analysis if he ordered more than once in the shop.

4.3.1.1 Customer Data

Only the age can be regarded as a numerical customer data column here. The mean and the median, which both imply an average customer segment consisting of people in their late 30s, appear interesting.
Variable Max Mean Median Min SD
Age 98 38.37 36 18 10.87
The following data summary shows some social data for the shop’s customer segment. Since all of the available data for the country column contains the value ‘United States’, it is highly probable that the online shop exclusively delivers customers located in the US. This was the reason for us to choose a map of the United States in order to visualize the customers’ locations later on in the plotting. From this overview we can already see that New York is clearly leading. Furthermore, the data clearly shows that the main customer audience targeted are women.
For an appropriate analysis some average values of the US should be regarded, such as population density or ratio of persons with children.
Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
City New York: 4.53% San Francisco: 2.05% Stamford: 1.24% Austin: 1.13% Brooklyn: 0.98% 90.07% 0%
Country United States: 100% 0% 2.83%
US State CA: 14.63% NY: 14.11% TX: 6.93% PA: 5.8% CT: 5.28% 53.25% 0%
Marital Status Married: 66.13% Single: 22.02% Inferred Single: 7.15% Inferred Married: 4.7% 0% 34.98%
Gender Female: 83.06% Male: 16.94% 0% 44.96%
Audience Women: 81.17% Men: 12.5% Children: 6.33% 0% 11.08%
Truck Owner False: 78.55% True: 21.45% 0% 22.22%
RV Owner False: 91.5% True: 8.5% 0% 22.22%
Motorcycle Owner False: 98.66% True: 1.34% 0% 22.22%
Working Woman False: 68.79% True: 31.21% 0% 22.22%
Presence Of Children False: 54.66% True: 45.34% 0% 22.22%
Speciality Store Retail False: 84.12% True: 15.88% 0% 22.22%
Oil Retail Activity False: 91.8% True: 8.2% 0% 22.22%
Bank Retail Activity False: 75.44% True: 24.56% 0% 22.22%
Finance Retail Activity False: 91.69% True: 8.31% 0% 22.22%
Miscellaneous Retail Activity False: 94.88% True: 5.12% 0% 22.22%
Upscale Retail False: 94.25% True: 5.75% 0% 22.22%
Upscale Speciality Retail False: 96.44% True: 3.56% 0% 22.22%
Retail Activity False: 60% True: 40% 0% 22.22%

4.3.1.2 Product Data

The selected columns belonging to the product information section show only factorial values. The statistical overview for the product data reveals that most of the sold articles are replenishable. The strongest brand in the current orders is American Essential, which seems to manufacture its articles by itself. The following table should be regarded as a popularity overview for product attributes.
Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
StockType Replenishable: 69.58% Seasonal 1: 23.08% Replenishment: 5.14% Seasonal 1*: 1.83% Seasonal 2: 0.24% 0.13% 14.72%
Manufacturer American Essentials: 20.91% Ridgeview: 16.64% HAN: 13.16% Donna Karan Company: 10.85% HOSO: 10.67% 27.77% 6.15%
BrandName AME: 22.07% HOSO: 11.26% ELT: 10.81% Silk Reflections: 9.35% DAN: 7.92% 38.59% 11.08%

4.3.1.3 Payment Data

When it comes to the data concerning the used payment methods, there are only factorial columns as well. The most used credit card is by far the VISA card. Furthermore, almost a fifth of the customers uses a premium card. From this information it could be deduced how wealthy the customer segment is by comparing the ratio of premium cards to the one in the whole population.
Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
Order Credit Card Brand VISA: 59.94% MC: 25.43% AMEX: 14.31% DISC: 0.31% NA 0.01% 16.71%
Bank Card Holder True: 86.57% False: 13.43% NA 0% 22.22%
Gas Card Holder True: 75.81% False: 24.19% NA 0% 22.22%
Upscale Card Holder True: 54.1% False: 45.9% NA 0% 22.22%
Unknown Card Type False: 56.18% True: 43.82% NA 0% 22.22%
TE Card Holder False: 89.42% True: 10.58% NA 0% 22.22%
Premium Card Holder False: 75.88% True: 24.12% NA 0% 22.22%
New Bank Card False: 99.55% True: 0.45% NA 0% 22.22%

4.3.1.4 Order Process Data

The numerical data for the order process shows that a customer usally buys one product per order. Also the order line amount implies that the store offers rather inexpensive articles. Furthermore, the minimum value for both the order line quantity and the order line amount is negative, which hints towards the assumption of eather the order data containing returns as well or having errors in it. Through the discount amount it is possible to state that the store offers a maximum of a 50% price reduction for the given time period. Also the purchased articles got on average a discount of about 9%. Again it can be assumed that this data is not representative for the shop’s offer in general, because it is probable that articles with a higher discount are bought more often.
Variable Max Mean Median Min SD
Order Line Quantity 18 1.31 1.0 -2 0.95
Order Line Unit List Price 72 9.26 7.5 0 6.46
Order Line Amount 234 11.62 10.0 -40 11.51
Order Line Hour of Day 23 13.04 13.0 0 5.29
Order Discount Amount 50 8.82 10.0 0 9.98
The factors for the order process demonstrate that the ‘FRIEND’ discount is used most often and in the majority of the orders. (At this point it would be relevant for the interpretation to know for whom and under which circumstances this discount is given.) Also the weekday summary could be relevant for sales purposes, by example for finding out the most successfull time for showing ads to possible customers: From this overview, it can be assumed that the weekday Wednesday could have a high chance for ad responds.
Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
Spend Over 12 Per Order On Average False: 64.04% True: 35.96% 0% 0%
Order Line Day of Week Wednesday: 26.96% Tuesday: 17.72% Thursday: 17.37% Friday: 16.48% Saturday: 8.43% 13.04% 0%
Order Promotion Code FRIEND: 82.09% SPRING: 2.14% MARCH1: 1.92% FREE: 1.39% 4128003160593466: 1.13% 11.33% 23.15%

4.3.2 Clickstream Data

The following tables display only the most important columns from the clickstream dataset. Interesting details are discussed in short texts. For the full range of columns see the Appendix.

4.3.2.1 Customer Data

Since the clickstream dataset contains a row for every click of a customer, the social information for a user gets easily multiplied by the browsing behaviour. Thus, the data has a high probability of being skewed when it comes to customer information. Therefore, we calculated the customer summary based on only one row per session.

The age has its average at 37.58 and its median at 36, implying a main customership in their late 30s.
Variable Max Mean Median Min SD
Age 86 37.58 36 18 10.71
The highest ranked cities for the clickstream sessions are San Francisco and New York, which each have a ratio of about 2% on the sessions. Additionally, the gender distribution is clearly dominated by females, heavily pointing out the shop’s main target group are women.
For all details see the full customer data table.
Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
City San Francisco: 2.31% New York: 2.18% Chicago: 1.35% Stamford: 1.09% Dallas: 0.71% 92.36% 96.92%
US State CA: 13.28% NY: 11.1% TX: 5.97% PA: 5.39% IL: 4.68% 59.58% 96.92%
Marital Status Married: 61.8% Single: 24.7% Inferred Married: 7% Inferred Single: 6.5% 0% 98.02%
Gender Female: 83.33% Male: 16.67% 0% 98.43%
Audience Women: 85.44% Children: 10.27% Men: 4.29% 0% 98.25%
Truck Owner False: 77.84% True: 22.16% 0% 97.72%
RV Owner False: 91.17% True: 8.83% 0% 97.72%
Motorcycle Owner False: 98.79% True: 1.21% 0% 97.72%
Working Woman False: 65.54% True: 34.46% 0% 97.72%
Presence Of Children False: 50.39% True: 49.61% 0% 97.72%
Speciality Store Retail False: 84.59% True: 15.41% 0% 97.72%
Oil Retail Activity False: 90.65% True: 9.35% 0% 97.72%
Bank Retail Activity False: 77.58% True: 22.42% 0% 97.72%
Finance Retail Activity False: 90.3% True: 9.7% 0% 97.72%
Miscellaneous Retail Activity False: 94.37% True: 5.63% 0% 97.72%
Upscale Retail False: 94.37% True: 5.63% 0% 97.72%
Upscale Speciality Retail False: 96.36% True: 3.64% 0% 97.72%
Retail Activity False: 64.33% True: 35.67% 0% 97.72%

4.3.2.2 Product Data

The most common stock type among the viewed products is “replenishable”. The strongest brand in the current orders is DKNY, which seems to manufacture its articles by itself.
For all details see the full products data table.
Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
StockType Replenishable: 60.93% Seasonal 1: 23.75% Seasonal 1*: 11.25% Seasonal 2: 2.07% Replenishment: 1.83% 0.17% 80.3%
Manufacturer Donna Karan Company: 10.79% Peneco: 9.08% HAN: 8.66% Kneipp: 6.61% Paul Lavitt Mills Inc.: 6.54% 58.32% 80.12%
BrandName DKNY: 9.99% Silk Reflections: 9.21% ORO: 8.85% HPK: 7.22% AME: 7.13% 57.6% 86.16%

4.3.2.3 Time Data

At Saturday the shop can register its highest visitor rates, which seems reasonable, since during weekends people usually have the most time for online shopping. At the 27. April, a Thursday in the year 2000, the shop had the most viewers for the given time period. As their were no big holiday coming up, this high activity may be due to weather circumstances, which have normally a high influence in online shopping behaviour.
Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
Request Date 2000-04-27: 10.35% 2000-04-28: 7.92% 2000-04-17: 7.43% 2000-04-19: 7.38% 2000-04-15: 7.25% 59.67% 0%
REQUEST_DAY_OF_WEEK Saturday: 16.66% Thursday: 15.4% Sunday: 14.97% Tuesday: 13.56% Wednesday: 13.32% 26.09% 0%
The request hour of the day shows an equally distributed viewer count for the morning and afternoon hours. This implies an unusual high activity in the night and morning hours. The session visit count gives an information on which number of visit the current session is for a customer. For example, a session visit count of 10 means that this user is currently in his 10th browsing session on the online shop. The maximum value of 974 seems very high for an observation period of only about half a month. This in combination with the unusual high activity during the night and morning points to automated bots being active on the website.
Variable Max Mean Median Min SD
REQUEST_HOUR_OF_DAY 23 11.30 11 0 6.2
Session Visit Count 974 13.02 1 1 73.8
Request Date 2000-04-30 NA NA 2000-04-14 NA

4.3.2.4 Calculated Data

Since not all interesting data can be recorded through simply summarising the existing columns, we decided to add calculated columns. The session duration gives the time in minutes a customer spent on the website, which gets calculated through the timestamps on the first and last click of a session. The click number represents how much clicks a customer made during one session. For both these values, the maximum value indicates that there could be bot activity on the website.
Variable Max Mean Median Min SD
Session Duration 1439.983 8.88 2.55 0 74.34
Click Number 1842.000 46.93 4.00 1 189.84

4.3.3 Comparison

When comparing the product and customer information for the order and click dataset directly, some interesting observations can be made.

4.3.3.1 Customer Data

Age: The summarised age values of the click and order dataset are similar, but not entirely equal. The median is indeed equal for both datasets. However, the mean differs a little and is lower for the click data, indicating that young people might browse more than they buy.
Variable Max Mean Median Min SD
Orders 98 38.37 36 18 10.87
Clicks 86 37.58 36 18 10.71
City: From the following overview it can be observed that especially people from New York tend to order relatively often compared to their browsing amount. But since a lot of the click data for the cities is not available, the comparison is possibly not that meaningful.
Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
Orders New York: 4.53% San Francisco: 2.05% Stamford: 1.24% Austin: 1.13% Brooklyn: 0.98% 90.07% 0%
Clicks San Francisco: 2.31% New York: 2.18% Chicago: 1.35% Stamford: 1.09% Dallas: 0.71% 92.36% 96.92%
Audience: The confrontation of the audience for both datasets displays that more women and children products get viewed than bought, whereas the opposite can be said for men products. Since it can be assumed that it is usually women, that look for children products, this observation shows that male users are the more goal-oriented buyers. This could be interesting by example for discount offerings, because an indecisive potential customer is more likely to be affected by such offers than a determined one.
Variable Top.1st Top.2nd Top.3rd Others Not.Available
Orders Women: 81.17% Men: 12.5% Children: 6.33% 0% 11.08%
Clicks Women: 85.44% Children: 10.27% Men: 4.29% 0% 98.25%
Children: From the presented overview table it can be seen that people with children tend to view less in ratio to buying than childless users. This effect could be explained by parents having rather less time. For the purpose of optimization, this discovery should be taken into account and the shop website could meet the needs of parents through increased user comfort and more effective recommendations.
Variable Top.1st Top.2nd Others Not.Available
Orders False: 54.66% True: 45.34% 0% 22.22%
Clicks False: 50.39% True: 49.61% 0% 97.72%

4.3.3.2 Product Data

Stock type: The seasonal part is 10% higher in the click data than in the order data and reduces the replenishable part by the same amount. This indicates a ratio of high interest of users on seasonal products, but a rather low interest on buying them. This could suggest that the prices or quality of seasonal products is not really appreciated by customers.
Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
Orders Replenishable: 69.58% Seasonal 1: 23.08% Replenishment: 5.14% Seasonal 1*: 1.83% Seasonal 2: 0.24% 0.13% 14.72%
Clicks Replenishable: 60.93% Seasonal 1: 23.75% Seasonal 1*: 11.25% Seasonal 2: 2.07% Replenishment: 1.83% 0.17% 80.3%

Manufacturer and Brand: Donna Karan is leading the top manufacturer and the top brands, but is closely followed by its competitors. Surprising is also the lack of presence of American Essentials in the click dataset, since it leads both the manufacturer and brands ranking for the order data. This indicates that American Essentials has a very high rate of order in ratio to product views.

Manufacturer
Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
Orders American Essentials: 20.91% Ridgeview: 16.64% HAN: 13.16% Donna Karan Company: 10.85% HOSO: 10.67% 27.77% 6.15%
Clicks Donna Karan Company: 10.79% Peneco: 9.08% HAN: 8.66% Kneipp: 6.61% Paul Lavitt Mills Inc.: 6.54% 58.32% 80.12%
Brand
Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
Orders AME: 22.07% HOSO: 11.26% ELT: 10.81% Silk Reflections: 9.35% DAN: 7.92% 38.59% 11.08%
Clicks DKNY: 9.99% Silk Reflections: 9.21% ORO: 8.85% HPK: 7.22% AME: 7.13% 57.6% 86.16%

4.4 Visualizations

Some information is too complex to be compressed into a single table without making it too confusing, or it’s simply easier to understand if presented as a plot. The plot types used are time series plots, stacked bar plots, distribution curves, lorenz curves and maps.

4.4.1 Order Data

4.4.1.1 Customer Data

The customer data of the order data can be viewed from two perspectives: One way is to use every single order row for the creation of the visualizations and therby create a weighted view on the data, in which customers that have bought more products are more respected. Another possibility is to display the customer information just for every unique customer in the order data and disregard the number of orders a customer made. The following plots show both perspectives.

Firstly, we generated some density curves for the attribute age to visualize the distribution of the customership. Also, we differentiated between the genders for this. The curves all show a fast rise in customership for the ages 20 to 40, which decreases slowly after a peak at about 35 to 40. When comparing the weighted and the normal graph, a slight shift of the curve can be observed. This indicates that in general older people tend to order more. This is especially relevant for an older male customership, which is shown by the peak at about 55 in the male curve of the weighted density plot. Additionally, a less wide peak can be observed for the male customers, which has its maximum shortly before the age of 40. Furthermore, it should be regarded, that the gender plot shows a percentual curve for each gender, but the ratio of customers differs by gender.

In order to show the customer location, we generated a map of the United States, which shows the cities with the highest customer numbers. From this it can be observed that on average the west coast of the US orders the most. The difference between the weighted and the other customer graph shows that the customership from large cities like New York or San Franscisco tends to have high order numbers since the large circles at these cities disappear, when we look at the non-weighted graph.

To visualize the importance of different areas, we created a heatmap for the different US States. Here we can see that California and New York are the most important customer states. This is probably highly influenced by the big cities in these states. Here, the distorting factor of population density must be taken into account.

4.4.1.2 Product Data

For the product data, it has to be regarded that we visualize the orders and thereby the characteristics of more frequently bought products have a higher influence. Because of this, the following plots should be seen as popularity graphs of different product attributes.

The following stacked bar plot shows the stock types for the top brands and manufactures. Most of them have a replenishable assortment. The biggest brands and manufacturers seem to have mixed stock types, which contain partly seasonal products.


Note: For the lorenz curves we dropped the x-axis tick labels for a better readability. For details see the top attributes for products in the summary tables.







The lorenz curve of the products shows that only a quarter of the whole product quantity is responsible for about 75% of the ordered products. This indicates that some products have a very high popularity. The curve is pretty steady, which indicates that the product popularity slowly and steadily decreases over the ranking.












The manufacturer lorenz curve shows a high ratio on orders for the biggest manufacturer, which leads to the assumptions that the popularity of some manufacturers is even higher than the product popularity. Furthermore, the graph seems to be a little pointy showing that the nine most popular manufacturers create the majority of products sold on the online shop.











The brand lorenz curve looks a little more flat, indicating a higher relevance for the manufacturer than for the brand. The curve here is a little pointy at two parts. The first part can be explained due to the high ratio of American Essentials on the orders (as we know from the summary tables). Another interesting anomaly in the curve can be observed after the top 11 brands, because the ratio of the following brands decreases strongly after this point.





4.4.1.3 Payment Data

The stacked bar plot for credit card brands shows the ratio of each brand on premium and on upscale cards. The card brand AMEX is noticeable due to a relative high ratio of premium and upscale cards.

4.4.1.4 Order Process Data

For the order process data some graphs referring to the order amount and price were created. It has to be mentioned again that the order data can deform the graphs, by example it is imaginable that the customers tend to buy the cheaper products and therefore the average product price seems lower.

The density curve for the discount amount shows 3 peaks: The first one has a medium height and is around a discount of 0%, the second one is around the 10% mark and is pretty large, whereas the last peak is at a discount of 50% and is rather low. This might indicate that customers buy rather targeted than randomly.

A density curve reflecting the order time is shown by the next plot. As to be expected the order amount goes down through the night. During the day the ordering is relative stable with some small peaks at 10 a.m. and in the afternoon. The activity in the afternoon can be explained by the average working hours, which mostly allow people only to spend time on online shopping in the afternoon and evening.

Next, there is displayed a graph, which summarizes some information on order behaviour. From this visulalization we can learn that the online shop sells products of a low price segments, but usually receives orders contaning a rather high amount (peak at 3-12) of articles. The different history plots on the right indicate a high activity in the first half of February, which drops in the beginning of the second month half and than slowly rises again. The first half could possibly be explained through customers buying presents for the valentines day (14th February).

4.4.2 Clickstream Data

4.4.2.1 Customer Data

The map clearly shows a very high browsing activity for the large US cities. Especially the east coast at the area around New York has a high ratio of online shop visitors.

The US states plot shows that there is a high activity in California and New York. In general, there is a lower activity at the states in the center.

4.4.2.2 Product Data

Note: For the lorenz curves we dropped the x-axis tick labels for a better readability. For details see the top attributes for products in the summary tables.








The lorenz curve for the products is rather flat, which indicates there are no big outliers for the product popularity counted by views. Furthermore, the curve seems to be a little pointy at about 60% of the products, which implies a product segment of about 40% with a really low interest rate.













The following graph appears a bit erratic, indicating that manufacturers can be divided into several popularity groups. Additionally, the curve is rather extended, which shows a highly unequal distribution for manufacturers viewed. This implies a high importance of the manufacturer for the viewership.













Whereas the brand lorenz curves shows a more equally distributed popularity, because it is overall rather flat. At about 50% of the brands, a pointy part can be seen in the graph, which divides the brands in a rather popular half and a rather unpopular one.








4.4.2.3 Time Data

Especially interesting for analysis purposes is the time component that comes with the clickstream dataset. Each clickstream consists of timestamps and a number in a sequence.

The following plot shows that most of the clicks are accumulated in the morning and then descend over day. The extreme spike at around 2 a.m. can be neglected as it seems to be bot activity with extremely long session times and sequences.

The mentioned behavior is also visible in the click time overall. Generally, there is more activity during the middle of April and less towards the end. Also, the seasonal component of the clickstream data set is visible very well in this kind of visualization.

While the session length on average is about 8 minutes, the median lies at about 2 minutes (see table). This is visible in the graph where the high point is also at about 2 minutes, indicating the skewness of the time data. Most sessions last only around 10 minutes, which leads to the assumption that products should be displayed well as the customers do not waste much time on searching for them. For a better visualization we left out sessions with a duration over 40 minutes.

4.4.3 Comparison

For some of the graphs a direct comparison makes sense in order to visualize the differences between the orders and clicks dataset.

4.4.3.1 Customer Data

The maps clearly show the numerical difference in both dataset. Especially Chicago and Dallas have to be highlighted because both cities have a really high viewer amount, but a rather small customership. One possible reason for this effect would be that the acitve bots originate from these two cities. The bot presence from Dallas could be explained by the high ratio of IT experts living there.

Orders Map

Clicks Map

4.4.3.2 Product Data

Note: For the lorenz curves we dropped the x-axis tick labels for a better readability. For details see the top attributes for products in the summary tables.

In the following the product data lorenz curves for both datasets will be compared. In each case the order plot is displayed on the left and the click plot on the right. For all plots it can be observed that the curve for the order data is more tilted towards the upper left corner. Thus, it can be stated that the orders contain a higher inequality for the popularity of product attributes. This means that people usually seem to pick the same products or brands, despite having a bigger variance in the viewed products.

5 Recommendation Evaluation

In addition to analyzing the order and clickstream data, we analyzed the performance of different recommendation models. The performance of three different recommendation systems was measured:

  • A profit based recommendation system: This one recommends products that shall fit the taste of the subject, but also generate high revenue shares.
  • A ranking based recommendation system: This one recommends best performing products according to their sales rank.
  • A random recommendation system: This one was used as a baseline treatment.

The evaluation of the profit and ranking based recommendation systems was done using inference analysis, specifically using the computational paradigm instead of the mathematical one. One test for each of the two recommendation systems was carried out with the null hypothesis always being that the system does not cause different sales than a purely random recommendation system. During each test we randomize our sample data 1000 times, using either permutation or bootstrapping, and measure the p-value and confidence interval. The test statistic, we always use, is the difference in mean between the group, using the profit or ranking based recomendation system, and the group using the random recommendation system. If the null hypothesis was true, then the test statistic value for our sample would not significantly differ from the distribution of the test statistic for our randomized data. Our default alpha for the confidence interval is 5%, but since we conduct a total of two tests, we have to apply the Bonferroni correction and adjust the alpha we specify for our confidence interval to 2.5%.

Before diving into the inference analysis itself, we had to reformat our data for the recommendation systems into a shape that is suitable for inference analysis. We want to have a data frame in which one row equals one customer, who was exposed to a recommendation system. We use three columns:

  • Sales: The sales in Euro per customer.
  • Used_Profit_Oriented_recommendations: 1 if the customer was exposed to the profit oriented recommendation system, otherwise 0.
  • Used_Top_recommendations: 1 if the customer was exposed to the ranking based recommendation system, otherwise 0.

If the value in the columns Used_Top_recommendations and Used_Profit_Oriented_recommendations is 0, it means that the random recommendation system was used.

In the following table you can see the first 10 rows of the reformatted dataset:

Sales_in_EUR Used_Profit_Oriented_recommendations Used_Top_recommendations
8.50 0 0
20.00 1 0
16.00 0 1
8.50 0 0
17.75 1 0
19.75 0 1
18.75 0 0
21.75 1 0
19.75 0 1
17.75 0 0

5.1 Profit Oriented Recommendation

We executed an inference analysis for the profit oriented recommendation system. Firstly, we look at the p-value and the corresponding plot:

## [1] "p-value = 0"

The plot shows us the distribution of the test statistic for the 1000 randomized samples. The test statistic value for our sample is represented by a black line. The two-sided p-value regions are marked by a grey background. If our null hypothesis was true, then the test statistic value of our sample would be somewhere in the distribution of the test statistic for the randomized samples. Every test statistic value for a randomized sample, which lies in the p-value region, increases the p-value.

As we can see, the test statistic value of our sample is pretty far away from the test statistic values of the randomized samples. This already shows, without looking at the p-value itself, that the profit oriented recommendation system causes a significant difference in the sales in Euro. The p-value is 0, which reaffirms our interpretation of the plot.

Next, we look at the confidence interval:

2.5% 97.5%
3.320897 4.409748

As we can see, there is a 95% chance that if the shop starts using the profit oriented recommendation system for all customers, they would spend on average 3.32€-4.41€ more than before.

5.2 Ranking Based Recommendation

The next step was to perform an inference analysis for the ranking based recommendation system. Firstly, we look at the p-value and the corresponding plot:

## [1] "p-value = 0.086"

In this plot some instances of the distribution of the test statistic for our random samples lie in the p-value zone. This is also shown by the p-value 0.086, which is greater than 0.025. This means that for our alpha = 0.05 the effect of the ranking based recommendation system is statistically insignificant.

Let us look at the confidence interval:

2.5% 97.5%
-0.0073997 1.28289

Since the confidence interval includes the value 0, it shows us that the effect is statistically insignificant.

5.3 Implications of the Evaluation

To sum it up, the company should use the profit oriented recommendation system, since out of the two tested systems it causes the largest increase in revenue. The ranking based recommendation system does not cause any statistically relevant difference in sales. However if the effect was something else than sales in Euro per person, then the results could be different. The company should ask itself if increasing revenue should really be their only goal for using recommendation systems. Maybe at some point in time the organization could introduce a subscription business model, similar to that of Amazon. In that case it might also be important to increase the percentage of customers that have a subscription.

6 Order Prediction

Following the inference analysis our next task was to predict, whether or not a person browsing the online shop will end up purchasing something, based on the browsing behaviour on the site. Since the vast majority of click sessions do not contain any customer information, due to unregistered users browsing the website or registered users not being logged in, we decided to omit that type of data for our prototypes. However, it is technically possible to include these attributes, which should be considered when developing a more advanced version of our prediction models.

6.1 Data Preparation

First of all, the relevant data had to be extracted or engineered from the clickstream data and was used to build a separate dataset, which represents the input data for a prediction model. The resulting data set contains one row per session. For each session the following attributes are used:

  • The number of clicks during a session
  • The session duration in seconds
  • The hour of day at the beginning of a session
  • The day of week at the beginning of a session
  • Whether the session contains an order (extracted by the URL “checkout/thankyou”)
Since the last two attributes have to be one-hot-encoded, it is only possible to show a subset of the data before this processing step:
Clicks Duration_in_Seconds REQUEST_DAY_OF_WEEK REQUEST_HOUR_OF_DAY Ordered
14 2028 Friday 23 No
8 272 Friday 23 No
2 39 Friday 23 No
16 91 Friday 23 No
15 79 Saturday 0 No
14 67 Saturday 0 No
16 68 Saturday 0 No
3 79 Saturday 0 No
7 189 Saturday 0 No
3 127 Saturday 0 No

However, an important thing to note is that about 99% of sessions do not contain an order. In addition to that, sessions consisting of only one click are ignored, since they offer no valuable information, because attributes like the session duration cannot be calculated.

6.2 Handling Bias and Variance

To handle model variance and bias we use the GridSearch approach. This allows us to get the optimal bias-variance trade off for each model, without having to inspect the source of variance or bias and fine tune model parameters manually. This way we do not need bias or variance related plots, such as learning curves.

While our approach optimises the bias variance trade off for each model, it does not account for bias and variance differences between several models. The two types of models we use for our problem are decision trees and random forests. By design, random forests have less bias and more variance than decision trees. This means, that depending on the performance of both models, we can see if our main problem is too much bias or variance in the data. For example, if random forests were to perform better than decision trees, it would mean that our main problem is too much bias in the data.

It should also be mentioned that depending on the business goal, the ratio of positive and negative instances in the training set should be adjusted. We use a 1:2 ratio of positives and negatives. Increasing the amount of negatives would cause a model to have an increased rate of true negatives and less false positives. However, it would also increase the rate of false negatives. Increasing the amount of positive instanced in the training set would have the reverse effect.

6.3 Decision Trees

The first model of choice is a decision tree, because it allows to easily understand the decision making process of the model. Since it removes the need for dimensionality reduction and feature scaling it also reduces the required amount of work. The evaluation metric used during the grid search is ROC-score. However, depending on the evolving needs of the company, one might want to chose a different evaluation metric.

Since the training set is created by using random samples, model performance can vary during each run. However, on average the following could be observed:

  • Accuracy ≈ 90%
  • F1 score ≈ 95%
  • ROC score ≈ 93%

Also, normally, the most important features are the number of clicks and the session duration.

The confusion matrix can be seen below. As you can see, the model classifies around 95% of sessions, which contain an order, correctly. However, only around 86% of instances, which do not contain an order, are classified correctly. This brings up the following question: How should each classification type be weighted? To answer this the company should first ask itself: “How, if at all, do we want to target session users differently?” For example, one idea could be to offer discounts to visitors, which have a high click rate during short sessions, as this could imply unsatisfaction with the price or quality of the visited products.

Decision Tree - Confusion Matrix

Decision Tree - Confusion Matrix

As mentioned above, the main reason for choosing a decision tree is the ease of understanding its logic and the automatic feature filtering. Below you can see a visualization of the decision tree. It has four layers. The only variables of interest are the session duration and the number of clicks during the session. For example, in this case whenever a session has over 10.5 clicks and lasts for longer than 289.5 seconds the tree will predict that the session contains an order. On the other hand if a session has less than (or precisely) 8.5 clicks, the tree will predict that it does not contain an order.

Decision Tree Vizualized

Decision Tree Vizualized

6.4 Random Forest

The second model of choice is a random forest. A random forest is an ensemble model made up of several decision trees. Due to the structure of a random forest it has less bias than a single decision tree, but it suffers from higher variance. It also offers less insight into the logic of the model, since it is made up of several trees. But depending on the forest complexity, it is still possible to present an overview of that logic. For example if the forest is made up of only 5 trees, each tree could be visualized. However, the more trees a forest is made out of, the less transparency it offers.

Again, since the training set is created by using random samples, model performance can vary during each run. The following values could be observed on average:

  • Accuracy ≈ 88%
  • F1 score ≈ 93%
  • ROC score ≈ 90%

This shows a 1%-3% percentage drop in performance compared to the previous decision tree model. This indicates a high variance in the data, which favors a decision tree more than a random forest.

The confusion matrix can be seen below. The model classifies around 98% of sessions, that contain an order, correctly. On the other hand it predicts that around 85% of sessions, which do not contain an order, do contain one.

In this case it looks like the model simply trades a lower ratio of false negatives for a higher ratio of false positives, compared to the decision tree. But on average the ratio of both, false positives and false negatives, is slightly higher when compared to the decision tree model.

Random forest - Confusion Matrix

Random forest - Confusion Matrix

6.5 Conclusion

First of all it should be evaluated which correct and false predictions are important. For example, the company might want to offer discounts to potential customers, who would normally not order something during their session. In that case the ratio of false negatives should be kept down as the company does not want to offer discounts to too many people, who would order a product anyway.
After this evaluation a fitting performance measurement metric can be chosen. Depending on the metric the performance ranking of different models can change. In addition to that, the ratio of positive and negative instances used to train the model should be adjusted, as it was described in this chapter.
Another important question is how relevant model transparency is. If it is not required at all to understand how the model predicts, and why, more models and modelling strategies should be considered. For example neural networks would become an option, or methods for automatic dimensionality reduction, such as PCA, or for iterative model improvements, such as Bagging or Boosting, should also be evaluated.
In addition, it should be considered whether a continuous training of the model is a good idea. On one hand it offers the advantge of staying up to date by feeding the newest customer behaviour data to the model. On the other hand it would require a higher investment in monitoring software and personnel as well as IT security. The reason for that is that continous learning creates the risk of model corruption, which requires constant performance monitoring, regular model backups and a quick way to switch to an older version of the model.

7 Conclusion

Finally, we would like to summarise our most important findings and point out some recommendations for the shop.

The analysis of the given datasets, in the shape of summary tables as well as visualizations, points out some interesting aspects for business decisions of the company providing the online shop. But for all of these results it has to be respected that the compared datasets refer to different time periods and therfore the results might not be that comparable. To address the most important discoveries: It could be observed that younger customer groups tend to browse more and buy less than the older ones. This might indicate a possible application area for special discounts on products adressed towards a rather young customership. The audience analysis results can be regarded as another important finding, because the comparison of both datasets showed that products for men are bought with comparably less browsing activity. This points towards a more goal-oriented customer base, which reduces the relevance for discounts on these products. Furthermore, it could be observed that significantly more seasonal products are viewed than bought. This probably hints towards a customer dissactisfaction regarding seasonal products. Hence, it should be considered to revise the price model for these product groups. From the order manufacturers’ lorenz curve it can be seen that the first ranked quarter of manufacturers has a very high effectivity. Thus, the shop could think about reducing the products created by unpopular manufacturers.

The company should use the profit recommendation system for all customers to increase the revenue. At the same time it should be evaluated if the revenue is the only important metric for recommendation system performance. Other metrics could be the rate of non-customer to customer conversion or the customer subscription rate.

Furthermore, the application of a predictive model requires a cost-benefit matrix to be constructed, which requires developing a strategy for dealing with buyers and non-buyers browsing the shop. One possible strategy is to offer discounts to browsing visitors, which are not likely to order a product during their browsing session. For that example the predictive models, we developed, offer a great performance already, since over 99,95% of predictions, that a visitor will not buy something, are correct. It should be noted however, that it is hard to estimate how many of these people could be tilted towards buying a product after being offered a discount.

During our analysis of the data we discovered traces pointing towards automated bots being active on the online shop. This is problematic, since it distorts the results of our analysis and the usefulness of input data for our predictive models. The company should consider investing some resources into spotting bot activity and its sources, to exclude data from bot activity from future analysis results as well as from input data for predictive models.

8 Appendix

8.1 Data Folder Structure

To get an overview of the content of the “0 Data” folder:

  • Each file has a suffix depending on what language was used for creating it. Files created with a Python script have the suffix “_P“, while files created with R have the suffix”_R“.
  • For both datasets three types of files are created:
    • A copy of the original dataset (e.g. “order_data_R.csv”)
    • A cleaned version of that dataset (e.g. “order_data_cleaned_R.csv”)
    • A smaller version of the cleaned data, which allows quick viewing and testing of the technical functionality of the coding (e.g. “order_data_small_R.csv”)

8.2 Full Clickstream Tables

The clickstream dataset contains a high quantity of information. Thus, not all attributes are displayed in the summary tables above, but can be seen in the following overview tables.

8.2.1 Customer Data

Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
WhichDoYouWearMostFrequent casual socks: 34.62% hosiery: 31.14% athletic socks: 19.15% trouser socks: 15.09% 0% 98.98%
YourFavoriteLegcareBrand Conair: 15.4% Nature Made: 12.79% Epilady: 10.7% eShave: 7.31% Lucky Chick: 6.53% 47.27% 99.24%
Registration Gender Female: 52.17% Male: 47.83% 0% 99.95%
NumberOfChildren 0: 60.87% 1: 13.04% 2: 13.04% 4 or more: 13.04% 0.01% 99.95%
DoYouPurchaseForOthers False: 100% 0% 96.96%
HowDoYouDressForWork business casual: 41.57% very casual: 29.22% business dress: 15.29% comfortable / athletic: 13.92% 0% 98.99%
HowManyPairsDoYouPurchase 15 or more: 48.9% 11 to 15: 30.94% 1 to 5: 14.37% 6 to 10: 5.79% 0% 99.01%
YourFavoriteLegwearBrand Hanes: 39.14% DKNY: 17.22% Donna Karan: 8.61% Danskin: 8.22% Berkshire: 5.09% 21.72% 98.99%
WhoMakesPurchasesForYou spouse: 50.51% friend: 24.24% parent: 23.23% siblings: 2.02% 0% 99.8%
NumberOfAdults 2: 43.48% 3 or more: 34.78% 1: 21.74% 0% 99.95%
HowDidYouHearAboutUs other: 38.32% friend / family: 31.99% e-mail: 18.15% print ad: 9.23% direct mail: 1.19% 1.12% 97.34%
SendEmail True: 65.92% False: 34.08% 0% 96.92%
HowOftenDoYouPurchase every 6 months: 75.84% once a year: 16.62% each week: 7.53% 0.01% 99.24%
HowDidYouFindUs Friend/Co-worker: 69.57% Other: 26.09% News Story: 4.35% 0% 99.95%
City San Francisco: 2.31% New York: 2.18% Chicago: 1.35% Stamford: 1.09% Dallas: 0.71% 92.36% 96.92%
US State CA: 13.28% NY: 11.1% TX: 5.97% PA: 5.39% IL: 4.68% 59.58% 96.92%
Email COM: 72.1% NET: 19.69% Gazelle: 2.95% EDU: 2.57% Other: 2.12% 0.57% 96.92%
Truck Owner False: 77.84% True: 22.16% 0% 97.72%
RV Owner False: 91.17% True: 8.83% 0% 97.72%
Motorcycle Owner False: 98.79% True: 1.21% 0% 97.72%
Marital Status Married: 61.8% Single: 24.7% Inferred Married: 7% Inferred Single: 6.5% 0% 98.02%
Working Woman False: 65.54% True: 34.46% 0% 97.72%
Mail Responder True: 76.54% False: 23.46% 0% 97.72%
Bank Card Holder True: 83.55% False: 16.45% 0% 97.72%
Gas Card Holder True: 72.73% False: 27.27% 0% 97.72%
Upscale Card Holder True: 51.08% False: 48.92% 0% 97.72%
Unknown Card Type False: 60.17% True: 39.83% 0% 97.72%
TE Card Holder False: 91% True: 9% 0% 97.72%
Premium Card Holder False: 78.96% True: 21.04% 0% 97.72%
Presence Of Children False: 50.39% True: 49.61% 0% 97.72%
Estimated Income Code $50;000-$74;999: 23.17% $75;000-$99;999: 17.74% $40;000-$49;999: 11.41% $30;000-$39;999: 11.23% $125;000 OR MORE: 10.34% 26.11% 97.78%
Home Market Value $75;000-$99;999: 16.61% $50;000-$74;999: 15.09% $100;000-$124;999: 13.68% $125;000-$149;999: 9.12% $150;000-$174;999: 7.49% 38.01% 98.31%
New Car Buyer True: 100% 0% 98.97%
Vehicle Lifestyle IMPORT (STANDARD/ECONOMY): 27.76% FULL SIZE (STANDARD/LUXURY): 22.64% TRUCK OR UTILITY VEHICLE: 12.6% SPECIALTY (MIDSIZE/SMALL): 11.42% STATION WAGON: 10.83% 14.75% 99%
Property Type single family dwelling: 86.59% condo: 7.53% 2-4 unit(duplex;triplex;quad): 2.35% misc. residential (condo store/flat): 1.88% apartment(5+ units): 0.94% 0.71% 99.16%
Loan To Value Percent 0% (NO LOANS): 30.26% 100-99%: 10.53% 70-74%: 8.88% 75-79%: 8.88% 80-84%: 8.88% 32.57% 99.4%
Presence Of Pool False: 98.87% True: 1.13% 0% 97.72%
Own Or Rent Home Owner: 93.56% Renter: 6.44% 0% 97.97%
Mail Order Buyer True: 64.68% False: 35.32% 0% 97.72%
DMA No Mail Solicitation Flag True: 100% 0% 97.72%
DMA No Phone Solicitation Flag True: 100% 0% 97.72%
New Bank Card False: 100% 0% 97.72%
Speciality Store Retail False: 84.59% True: 15.41% 0% 97.72%
Oil Retail Activity False: 90.65% True: 9.35% 0% 97.72%
Bank Retail Activity False: 77.58% True: 22.42% 0% 97.72%
Finance Retail Activity False: 90.3% True: 9.7% 0% 97.72%
Miscellaneous Retail Activity False: 94.37% True: 5.63% 0% 97.72%
Upscale Retail False: 94.37% True: 5.63% 0% 97.72%
Upscale Speciality Retail False: 96.36% True: 3.64% 0% 97.72%
Retail Activity False: 64.33% True: 35.67% 0% 97.72%
Dwelling Size SINGLE HOUSEHOLD: 75.58% 2 HOUSEHOLDS: 6.78% 100+ HOUSEHOLDS: 3.53% 3 HOUSEHOLDS: 2.41% 10-19 HOUSEHOLDS: 1.95% 9.75% 97.87%
Lendable Home Equity EQUITY LESS THAN OR EQUAL $0: 33.22% EQUITY $10;000-$19;9999: 9.54% EQUITY $1-$4;999: 8.55% EQUITY $75;000-$99;999: 8.55% EQUITY $100;000-$149;999: 8.22% 31.92% 99.4%
Home Size Range 1;250-1;499 FT: 16.04% 2;000-2;499 FT: 15.41% 1;000-1;249 FT: 14.47% 1;500-1;749 FT: 13.84% 1;750-1;999 FT: 10.69% 29.55% 99.37%
Lot Size Range 1 ACRE OR LESS: 89.58% GREATER THAN 1 ACRE: 10.42% 0% 99.43%
Dwelling Unit Size SINGLE FAMILY DWELLING UNIT: 74.03% MULTI FAMILY DWELLING UNIT: 25.97% 0% 97.81%
Available Home Equity EQUITY $30;000-$49;000: 19.67% EQUITY $50;000-$74;000: 18.56% EQUITY $75;000-$99;999: 11.6% EQUITY $20;000-$29;000: 11.27% EQUITY $100;000-$149;999: 11.05% 27.85% 98.21%
Minority Census Tract False: 98.61% True: 1.39% 0% 97.72%
Gender Female: 83.33% Male: 16.67% 0% 98.43%
Occupation PROFESSIONAL/TECHNICAL: 32.33% HOUSEWIFE: 15.54% ADMINISTRATIVE/MANAGERIAL: 14.29% CLERICAL/WHITE COLLAR: 11.28% STUDENT: 6.02% 20.54% 99.21%
Other Indiv Gender Male: 81.91% Female: 18.09% 0% 98.84%
Other Indiv Occupation PROFESSIONAL/TECHNICAL: 45.9% ADMINISTRATIVE/MANAGERIAL: 17.93% CRAFTSMAN/BLUE COLLAR: 13.68% SALES/SERVICE: 6.38% CLERICAL/WHITE COLLAR: 4.86% 11.25% 99.35%
Variable Max Mean Median Min SD
Year of Birth 1979 1966.58 1966 1948 7.95
Value Of All Vehicles 99000 19111.56 16000 1000 14190.98
Age 86 37.58 36 18 10.71
Other Indiv Age 86 40.90 38 18 11.51
Number Of Adults 6 2.54 2 1 1.35
Year House Was Built 1997 1963.50 1968 1850 27.02
Length Of Residence 15 6.77 6 0 4.39
Year Home Was Bought 1999 1991.18 1993 1954 6.34
Home Purchase Date 199906 199121.06 199300 195400 633.82
Number Of Vehicles 3 1.47 1 1 0.61
CRA Income Classification 4 3.31 3 1 0.63
Number Of Credit Lines 9 2.72 3 1 1.61
Dataquick Market Code 10 4.61 4 1 2.53
Insurance Expiry Month 12 6.54 6 1 3.45
Month Home Was Bought 12 7.03 7 1 3.42
Year Of Structure 1999 1973.29 1980 1900 26.47

8.2.2 Product Data

Variable Top.1st Top.2nd Top.3rd Top.4th Top.5th Others Not.Available
BrandName DKNY: 9.99% Silk Reflections: 9.21% ORO: 8.85% HPK: 7.22% AME: 7.13% 57.6% 86.16%
PrimaryPackage Bottle: 35.18% Tube: 30.46% Jar: 25.55% Box: 5.4% Spray: 3.41% 0% 96.07%
StockType Replenishable: 60.93% Seasonal 1: 23.75% Seasonal 1*: 11.25% Seasonal 2: 2.07% Replenishment: 1.83% 0.17% 80.3%
ProductForm Cream: 54.01% Liquid: 24.19% gel: 7.58% Lotion: 7.36% Capsule: 5.97% 0.89% 96.64%
Look Sheer: 83.41% Ultra Sheer: 13.1% Opaque: 3.48% 0.01% 94.22%
BasicOrFashion Basic: 92.33% Fashion: 7.67% 0% 86.11%
MfgStyleCode Tricot: 2.34% BC27340: 1.45% 00N02: 1.35% 00Q63: 1.34% 5751: 1.2% 92.32% 82.06%
SaleOrNonSale NSALE: 100% 0% 94.46%
HasDressingRoom False: 73.56% True: 26.44% 0% 86.09%
ColorOrScent Scent: 85.69% Color: 14.31% 0% 99.7%
Texture Flat: 66.48% Textured: 33.52% 0% 96.65%
Manufacturer Donna Karan Company: 10.79% Peneco: 9.08% HAN: 8.66% Kneipp: 6.61% Paul Lavitt Mills Inc.: 6.54% 58.32% 80.12%
ToeFeature SF: 86.04% RT: 13.96% 0% 93.49%
Category2 Gift Sets & Special Items: 32.01% Skincare: 23.28% Cellulite & Other Treatments: 22.82% Footcare: 14.47% Health Supplements: 6.2% 1.22% 99.21%
Material Cotton: 66.35% Nylon: 23.62% Coolmax: 3.71% Rayon: 1.49% Lycra: 1.04% 3.79% 93.18%
CategoryCode PH: 33.89% WDCS: 12.66% TH: 7.83% FO: 6.76% TT: 6.2% 32.66% 86.09%
WaistControl CT: 76.17% STW: 23.83% 0% 94.34%
Collection Oroblu Italian Hosiery: 6.31% Conversationals: 5.17% DKNY Skin: 4.41% Action Pack 3-Pair: 3.9% Womens Dance: 3.84% 76.37% 86.96%
BodyFeature MBC: 64.5% UBC: 17.51% LBC: 11.17% BS: 6.82% 0% 98.49%
Audience Women: 80.86% Men: 10.21% Children: 8.93% 0% 86.09%
Category1 Skincare: 60.62% Footcare: 18.05% Cellulite & Other Treatments: 15% Hair Removal: 3.89% Health Supplements: 1.83% 0.61% 94%
Product Cellulite Trimming Gel: 3.25% Body Lotion - Oceanic Minerals: 2.74% Kit-Firming Cream/Slimming Cream/Shorts: 2.57% Body Silk: 2.41% Herbal Foot Balm: 2.31% 86.72% 94%
Pattern Solid: 58% Conversational: 39.12% Floral: 2.03% Stripe: 0.55% Herringbone: 0.18% 0.12% 93.72%
Variable Max Mean Median Min SD
UnitsPerInnerBox 12.0 4.44 3.00 1.00 2.91
Depth 16.0 2.73 2.50 0.50 2.15
VendorMinREOrderDollars 500.0 161.15 150.00 100.00 82.29
Height 8.5 1.18 0.75 0.25 1.20
UnitsPerOuterBox 144.0 18.57 12.00 4.00 16.83
Pack 3.0 1.14 1.00 1.00 0.50
Length 16.5 9.12 9.25 3.50 1.65
MinQty 144.0 16.48 6.00 0.00 30.45
LeadTime 28.0 10.95 10.00 1.00 6.41
Weight 40.0 4.98 2.60 0.40 5.59
Width 18.0 5.68 6.25 0.50 1.93
UnitIncrement 36.0 4.81 3.00 1.00 4.38